Conversation
openfold3/setup_openfold.py
Outdated
|
|
||
| def setup_biotite_ccd() -> None: | ||
| logger.info("Starting Biotite CCD setup...") | ||
| if not biotite.setup_ccd.OUTPUT_CCD.exists(): |
There was a problem hiding this comment.
Would we actually expect this condition to ever trigger? Biotite ships a CCD during its own setup, so I'd always expect one to exist there. Imo the problem in Pat's case isn't that there is no Biotite CCD downloaded, it is that the ligands he is working with are so recent that they won't be found in outdated CCDs, which is why one needs to trigger a manual update.
There was a problem hiding this comment.
No, actually – empirically I found that I need to run this and then tests start to pass. The installed biotite somehow comes with a different CCD file?
There was a problem hiding this comment.
Ah, no I think you're partially right actually – there is some default CCD file that installs with biotite but it's out-of-date and running the setup command updates it
There was a problem hiding this comment.
Yeah this is caused by our version constraint of biotite 1.2.0, which likely ships an outdated CCD. I can confirm that a non-constrained install of biotite doesn't run into this problem.
There was a problem hiding this comment.
It's still a general problem though, of course we can ameliorate it for some time by installing a fresh CCD on setup, but the PDB will keep adding new ligands weekly so you can quickly run into the same problem again. It might be reasonable enough to just include a helpful error message when a ligand could not be found in the CCD that suggests running the setup_ccd command.
There was a problem hiding this comment.
(and we'd need to get rid of the guard because biotite.setup_ccd.OUTPUT_CCD.exists() should always return true after biotite was installed)
There was a problem hiding this comment.
Overall this is great!
Is there a way to fetch the date / version of the CCD file? Then setup_ccd could prompt the user if they want to download a newer version when it is available.
There was a problem hiding this comment.
Negative, we can get an ETag/last modified of the remote CCD file – but it's not the same file as biotite saves, because it does some modification prior. So we can't do a simple ETag1 == ETag2.
Instead I opted for a "stale ETag list" and if the file on disk is stale, it'll re-download.
I excluded all solutions with a sidecar file, this will only make it more messy.
This is overall not great because we hard-code a md5 sum in code but until we move beyond biotite==1.2.0, I don't see a better way.
jnwei
left a comment
There was a problem hiding this comment.
Nice addition, clean way to update the CCD automatically upon openfold3 setup.
ljarosch
left a comment
There was a problem hiding this comment.
Generally I like that we're cleaner about this now and making our tests more robust. However how would this fix this issue for users like Pat at inference time? Is it because this update function will run at setup?
Generally we will keep running into this problem, this fix will update the CCD once, but if you wait a month and try to inference a recent compound you will run into the same issue again. I think it would make sense to add a more prominent pointer to python -m biotite.setup_ccd to our docs, and also reference it in the error message when a CCD entry can't be found.
docs/source/Installation.md
Outdated
| - Setup a directory for OpenFold3 model parameters [default: `~/.openfold3`] | ||
| - Writes the path to `$OPENFOLD_CACHE/ckpt_path` | ||
| - Download the model parameters, if the parameter file does not already exist | ||
| - Download and setup the [Chemical Component Dictionary (CCD) with Biotite](https://www.biotite-python.org/latest/apidoc/biotite.structure.info.get_ccd.html) |
There was a problem hiding this comment.
Not sure if the pointer of this link is super helpful, this is just a function to work with Biotite's CCD representation in code. Why not just add python -m biotite.setup_ccd if the intent is to show users how to download and set it up?
There was a problem hiding this comment.
We could surface this as work-around but that's really our abstraction leaking. One day we might remove biotite, and the users should be none the wiser about it.
There was a problem hiding this comment.
My intention was to provide users who might be unfamiliar with the CCD some context about what it is since it is now included in our top level Installation documentation and how biotite is used in relation to the CCD.
We can certainly replace this link with a more general pointer to the CCD (perhaps this one: https://www.wwpdb.org/data/ccd) if that would be more helpful. This might make more sense if we now plan to provide the CCD file ourselves
|
You do have a point about this working "once only" – only if the bcif has the stale hash, will this ever update, and after-wards it'll just be stuck. But I don't see a way around it other than by introducing state, which I really-really don't want to do. |
|
Sorry, i know this is becoming a PR turned RFC nightmare. Here is what i think we should do, because there is a real limitation to the approach in this PR, in that the update will happen once-only. Context
Options
|
jnwei
left a comment
There was a problem hiding this comment.
Nice solution! Provides an easy way to provide an updated CCD bcif ourselves and inserting into biotite's default CCD. The only potential issue I can think of is if a structure gets added to the CCD that biotite cannot support, but that seems like a very narrow edge case.
Two requests:
- Could we add a small script / workflow for how to update the CCD file to the bucket?
- I added a suggestion to make the test for smiles / CCD comparision more robust.
docs/source/Installation.md
Outdated
| - Setup a directory for OpenFold3 model parameters [default: `~/.openfold3`] | ||
| - Writes the path to `$OPENFOLD_CACHE/ckpt_path` | ||
| - Download the model parameters, if the parameter file does not already exist | ||
| - Download and setup the [Chemical Component Dictionary (CCD) with Biotite](https://www.biotite-python.org/latest/apidoc/biotite.structure.info.get_ccd.html) |
There was a problem hiding this comment.
My intention was to provide users who might be unfamiliar with the CCD some context about what it is since it is now included in our top level Installation documentation and how biotite is used in relation to the CCD.
We can certainly replace this link with a more general pointer to the CCD (perhaps this one: https://www.wwpdb.org/data/ccd) if that would be more helpful. This might make more sense if we now plan to provide the CCD file ourselves
| struct_from_smiles = structure_with_ref_mol_from_smiles(smiles, chain_id="X") | ||
| struct_from_ccd = structure_with_ref_mol_from_ccd_code(ccd_code, chain_id="X") | ||
|
|
||
| assert len(struct_from_smiles.atom_array) == len(struct_from_ccd.atom_array) |
There was a problem hiding this comment.
Some quick suggestions to make this test more comprehensive:
-
Would it make sense to use the
custom_assert_utils.assert_atom_array_equalutility here? Or would we expect some differences in atom ordering / bond ordering so that this assert would fail. -
If
assert_atom_array_equalis not a good fit here, perhaps we could add a test to check the number of carbons or some other aggregate property.
There was a problem hiding this comment.
Ah, it'd be really nice indeed if we could use that – but it currently doesn't work, just tried empirically (atm atom names different between smiles and CCD, eg C01 vs C1) – maybe we can just match on element count?
There was a problem hiding this comment.
Yeah figures. I think the atom counter is good enough here for now, thanks for adding it.
| struct_from_smiles = structure_with_ref_mol_from_smiles(smiles, chain_id="X") | ||
| struct_from_ccd = structure_with_ref_mol_from_ccd_code(ccd_code, chain_id="X") | ||
|
|
||
| assert len(struct_from_smiles.atom_array) == len(struct_from_ccd.atom_array) |
There was a problem hiding this comment.
Yeah figures. I think the atom counter is good enough here for now, thanks for adding it.
Summary
Fixes #94
Changes
Related Issues
Testing
Other Notes